Jiang et al. BMC Proceedings 2014, 8(Suppl 1):S18 
http://www.biomedcentral.eom/1753-6561/8/S1/S18 



^BMC 

Proceedings 



PROCEEDINGS Open Access 



Family-based association test using normal 
approximation to gene dropping null distribution 

Yuan Jiang, Sarah Emerson, Lu Wang, Lujing Li, Yanming Di* 

From Genetic Analysis Worl<sliop 18 
Stevenson, WA, USA. 13-17 October 2012 



Abstract 

We derive tUe analytical mean and variance of the score test statistic in gene-dropping simulations and 
approximate the null distribution of the test statistic by a normal distribution. We provide insights into the gene- 
dropping test by decomposing the test statistic into two components: the first component provides information 
about linkage, and the second component provides information about fine mapping under the linkage peak. We 
demonstrate our theoretical findings by applying the gene-dropping test to the simulated data set from Genetic 
Analysis Workshop 18 and comparing its performance with existing population and family-based association tests. 



Background 

When testing genotype-phenotype association using 
individuals from extended families, one has to account 
for correlations in genotypes and/or phenotypes between 
related individuals. One simple and effective method to 
account for genotype correlations is to simulate the null 
genotype distribution by gene dropping [1], which is 
simulating founder alleles according to estimated allele 
frequencies and dropping these alleles down the pedi- 
grees according to random segregation of gametes (i.e., 
Mendel's first law). The gene-dropping method is 
straightforward to implement (e.g., implemented in by 
Allen-Brady et al [2]) and applies to all pedigree struc- 
tures, but it is computationally intensive and thus is 
impractical to use when dealing with millions of single- 
nucleotide polymorphisms (SNPs). 

In this article, we derive the analytical mean and var- 
iance of the score test statistic under the gene-dropping 
setting and approximate the gene-dropping null distri- 
bution of the test statistic by a normal distribution with 
the analytically derived mean and variance. Using this 
normal approximation, the gene-dropping test becomes 
computationally efficient and can be easily applied to 
millions of SNPs. 

Furthermore, we provide insights into the gene- 
dropping test by decomposing the test statistic into two 

* Correspondence: diy(astat.oregonstate.edu 

Department of Statistics, Oregon State University, Corvallis, OR 97331, USA 



components: the first component resembles a quantity 
frequently used in variance-component based linkage 
tests and provides information for linkage, and the 
second component provides information for fine map- 
ping under the linkage peak. Rabinowitz and Laird [3], 
among others, have pointed out the subtle distinction 
between two types of null hypotheses in family-based 
association analysis: the null hypothesis of no linkage 
and no association versus the null hypothesis of no 
association in the presence of linkage. To test the latter, 
one needs to condition on the inheritance vector at 
the test locus [3]. Our decomposition provides an explicit 
separation of linkage and association information in a 
family-based study. 

We compare the performance of the gene-dropping 
test (using normal approximation) to association tests 
using only unrelated individuals and to the family-based 
association test in the software program FBAT [3] by 
analyzing Genetic Analysis Workshop 18 (GAW18) 
simulated data set. 

Methods 

Preprocessing of genotype data 

We analyzed SNPs from chromosome 3 only. At each of 
the SNPs, we performed Pearson's chi-squared test for 
the Hardy- Weinberg equilibrium using 142 unrelated 
individuals. We excluded SNPs that yielded a /?-value 
smaller than 10 * from our analysis. In the gene-dropping 
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test, we excluded SNPs with estimated minor allele fre- 
quency (MAF) smaller than 0.001. 

Preprocessing of phenotype data 

We focused on the analysis of the quantitative trait systolic 
blood pressure (SEP) in the simulated data set 1. The true 
simulation model was known to us [4]. When testing asso- 
ciation between genotype doses and trait values (see later 
discussion), we include factors AGE, SEX, and AGE by 
SEX interaction as covariates (Z^'s in equation [1]). 
Including BPMED as a covariate will overcompensate 
because BPMED is a consequence of SEP level. Instead, 
we estimated the effect of BPMED from a regression 
model with only individuals with hypertension. Because 
BPMED was randomly assigned to individuals with hyper- 
tension, the BPMED effect estimated this way will not be 
biased by its correlation with SBP. We then adjusted the 
trait values Y by subtracting the estimated BPMED effect. 

Score tests of genotype-phenotype association using 
unrelated individuals 

At locus T, we consider a quantitative trait model 

EiY) = f^ + J2ti"kZk+Xrfir, (1) 

and test the null hypothesis Pr = 0. In equation (1), Y is 
the vector of trait values (SEP adjusted for the BPMED 
effect), is a constant vector of baseline mean trait 
values, coefficients oik represent the effects of the covari- 
ates Zk,k=l,...,K, (e.g., AGE, SEX and AGE by SEX 
interaction) on trait values, is the vector of genotype 
doses (the number of minor alleles possessed by each 
individual) at locus t , and the coefficient /J^ represents 
the effect size of a single allele. The fitted value of fir will 
reflect the collective effect of all causal SNPs that are in 
linkage disequilibrium (LD) with the test SNP t [5] . 

Let Y and be the vectors of fitted values after 
regressing the Y and Xr on measured covariates 's. 
The score statistic [6,7] for testing genotype-trait asso- 
ciation at a single SNP t is u = X'^R , where R = Y —Y 
is the vector of residuals. Under the null hypothesis of 
no association, the variance of u is estimated by 

V = SYvKiXr - Z{Z'Z)-^Z'Xr) = SwK {^r ' Xr) , (2) 

where Z = (l,Zi, . . . ,Zjf) and Syy is the sample variance 
of the residual trait values ( 1 is a vector of ones) [6]. To 
test association, u^jv is compared with a X\ distribution. 

Family-based association test by gene dropping 

When related individuals are used to compute the score 
test statistic u =X'^R, components of Xr can be depen- 
dent, and the variance estimator (2) is no longer valid. 
One can account for correlations between components 



in Xr by simulating the null distribution of Xr using 
gene dropping. We now derive the analytical mean and 
variance of u under the gene-dropping setting. In the 
score test using unrelated individuals, we treat R as 
random, and Xr can be viewed as either random or 
fixed. In a gene-dropping simulation, R is held fixed, 
and Xr is random. 

Let i,j index individuals = 1, . . . , n) and let 
Xr = (Xi, . . . ,X„)' and R = (Rj, . . . ,R„)'. The expected 

En 
. ^ E (Xi) Rj and X, = P, + M, , where 

(Mi) (Mi) is 1 if the paternal (maternal) allele is the 
minor allele and 0 otherwise. So E (Xi) is twice the MAF 
/,: at SNP T and is the same for all individuals and thus 

E(u) = 2fr ^^^i = 0 because R, 's are residuals from a 
linear regression model with intercept. The variance of u 
is E(u^) = R'E(XrX;)R. The (i,;) th element in E{XrX'^) 
is E (XiXj) = E {PiPj + PiMj + MiPj + MiMj) . Pi, Mi, Pj, Mj 
are all Bernoulli random variables with probability fr , and 
any two of them are identical if the corresponding alleles 
are identity-by-descent (IBD) and are independent other- 
wise [8]. Let 4>ij be the number of IBD pairs among the 
four pairs of alleles PjPj, PiMj, MiPj, MiMj . The value of 
01) at locus r is determined by the inheritance vector Sr , 
which summarizes whether the paternal or the maternal 
allele is passed from the parent to the child in each meio- 
sis [9] . Given the inheritance vector Sr , 

E(X,Xj\S,) = </.,, (S,)fr + (4 - 0, {Sr))f^ = <t>,i (Sr) (fz + 4/,^ 

(e.g., E (PjPj) = E (Pf ) = fr if Pi and Pj correspond to 
IBD alleles and E (P,P,) = E (P,) E {Pj) = if P, and Pj 
correspond to non-IBD alleles). In a gene-dropping 
simulation, the inheritance vector Sr is randomly 
sampled among all possible inheritance vectors. The 
expected number of IBD alleles shared between / and 
E((/),j (Sr)),, E(0,j (St)), over all possible inheritance 
vectors is four times the kinship coefficient 
rjfij:E[^ij (Sr)) = ^ifij. The kinship coefficients are deter- 
mined by pedigree structures. The expected value of 
XiXj in a gene-dropping simulation is thus 
E [E {XiX,\Sr)) = 4xl,ij {fr - /2) + 4/2. Letting 
0 (Sr) = {(pij) be the matrix of IBD counts and * be 
the matrix of kinship coefficients, we can rewrite the 
above as: 

E {XrX'rlSr) = $ (Sr) {fr - f^) + 4//^, 

E (X,x;) =E{E (X,X;|S,)) = 4* {fr -f^) + 47/2, 

where / is a matrix of all ones. Because R!]R = 0 for 
residuals from a linear regression model with an 
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intercept, the variance of u under gene dropping 
is Vgd = R'E (X^X;) R = 4R'^R (f^ -/t^) if uncondi- 
tional on the inheritance vector S^, and is 
Vr=R'E{X,X'^\S,)R=R't>(S,)R(fr-fr^) if condi- 
tional on the inheritance vector Sr (holding fixed). 
We can approximate the gene-dropping null distribu- 
tion of M by a normal distribution with mean 0 and 
variance Vgd , and compute the gene-dropping /?-value 
by comparing t = u^/vgd with a Xi distribution. To 
test association in the presence of linkage, one needs 
to condition on the inheritance Sx vector at t[3] and 
use v-c. In practice, is not observable, but we esti- 
mate Vr by drawing Markov chain Monte Carlo 
(MCMC) samples of Sr based on observed genotypes 
in the pedigrees using MORGAN (http://www.stat. 
washington.edu/thompson/Genepi/MORGAN/Morgan. 
shtml) [10]. 

Results 

Theoretical findings 

In a gene-dropping simulation, the analytical mean of 
the score statistic u = X^'R is 0- The variance of the 
score statistic is R' (t> (S^) R (fx — fx) if conditional on 
the inheritance vector (i.e., holding the inheritance vec- 
tor fixed during gene-dropping simulation) and is 
4R'*R (fx — fx) if unconditional on the inheritance vec- 
tor. The normal approximation is justified by the central 
limit theorem because the test statistic is additive over 
pedigrees. Its performance depends on the number, 
sizes, and structure of pedigrees and on MAF at the test 
locus. The approximation may not be accurate for extre- 
mely small /j-values. However, the rankings of the p- 
values will not change. 

We can decompose the unconditional gene-dropping 
test statistic into two components: 



R'<I>(Sr)R' 



4R'*R(f,-/2) 



R'^{Sx)R(fx-n) 



4R'>I'R 



The first component can be used as a test statistic for 
detecting association in the presence of linkage (i.e., fine 
mapping under a linkage peak) because the denominator 
is the variance of u conditional upon the observed IBD 
sharing. The second component provides information 
about linkage. The kinship coefficients in are deter- 
mined by pedigree structure, so R'vI^R is a constant in a 

gene-dropping simulation. R'<J> (St) R = nrj^ij (Sx) 

measures the correlation between trait value similarity 

(riTj) and IBD sharing [4>ij) at locus t across all pairs 
of individuals in a pedigree. This correlation is expected 



to be stronger if there is stronger linkage between t and a 
true causal locus. Therefore, R'O (Sx) R can be used as a 
test statistic to detect linkage, with null distribution 
obtained by gene-dropping simulations. In a gene-drop- 
ping simulation, the inheritance vectors are simulated as if 
they were from a marker unlinked to any potential causal 
loci. R'O (Sx) R resembles similar quantities that are fre- 
quently used in linkage analysis methods such as the well- 
known Haseman-Elston regression [11] as well as many 
variance components or generalized estimating equation- 
based methods [12]. 

Simulation results 

We performed a genome-wide association studies 
(GWAS) score test using 142 unrelated individuals, the 
family-based association test using FEAT [3], and the 
gene-dropping test on SNPs on chromosome 3 (FEAT 
and the gene-dropping test used 847 individuals from 20 
pedigrees). Table 1 summarizes the /7-value ranks that 
each test assigns the true causal SNPs. The gene-dropping 
test for fine mapping (conditional on the inheritance 
vector) performs very similarly to the unconditional gene- 
dropping test, so its results are omitted. It is seen that the 
gene-dropping tests can quickly identify a few true causal 
SNPs within a short list of top findings. However, if we 
allow more false positives by considering a greater number 
of the most significant SNPs, other methods start to pick 
up true causal SNPs and eventually have a result similar to 
gene dropping. 

Figure 1 shows the physical positions and negative log p- 
values of the top 500 SNPs identified by each of the three 
tests, as well as the negative log p-values of the linkage 
test based on the linkage component of the gene-dropping 
test statistic. 

We also examined adjusting for population stratification 
by fitting the first two principal components of genetic 
variation [13] as covariates in the regression model (1). 
The /^-values resulting from this expanded model differed 
negligibly from the original model. The ranks in Table 1 
were essentially unchanged by this adjustment. 

Discussion 

Comparison between genome-wide association studies, 
FBAT and gene-dropping test 

FEAT splits each pedigree into nuclear families. In each 
nuclear family, FBAT uses information from the off- 
spring while conditioning on the parental marker geno- 
types. In contrast, GWAS uses information in unrelated 
individuals. The two methods use almost "orthogonal" 
sources of information. There is almost no correlation 
between the log p-vzlues from these two methods 
(Table 2). In contrast, the gene-dropping test applies to 
multigeneration pedigrees and uses information from all 
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Table 1 Ranks of truly influential single-nucleotide polymorphisms by genome-wide association studies, FBAT, and 
gene dropping 





GWAS 






FBAT 






Gene dropping 




Rank 


Relative rank (%) 


SNP position 


Rank 


Relative rank (%) 


SNP position 


Rank 


Relative rank (%) 


SNP position 


22 


0.00212 


47957996 


1433 


025642 


47956424 


1.5 


0.00012 


48040283 


27 


0.00260 


48040283 


2,903 


0.51947 


47958037 


3.5 


0.00029 


47957996 


1,561 


0.15024 


141693906 


2,913 


0.52126 


50185967 


202.5 


0.01686 


47958037 


5,901 


0.56796 


47467805 


5,536 


0.99062 


48040283 


232 


0.01932 


47956424 


11,415 


1 .09868 


58161774 


9,086.5 


1 .62595 


47957996 


3,937 


0.32787 


48040284 


21,148.5 


2.03552 


47958037 


1 5,860.5 


2.83810 


141093285 


1 3,668 


1.13826 


58109162 


23,791 


2.28985 


1 96597635 


17,341.5 


3.10311 


141162128 


1 9,870.5 


1 .65480 


123170592 


28,783 


2.77033 


135789360 


22,148.5 


3.96328 


139276557 


37,071 


3.08725 


141162128 


30,761.5 


2.96075 


47956424 


23,778 


425487 


141160882 


40,497.5 


3.37261 


47913455 


34,720.5 


3.34180 


58190853 


32,483 


5.81256 


58192585 


42,740 


3.55936 


141160882 



Rank is raw ranks in terms of p-value significance of truly influential single-nucleotide polymorphisms (SNPs) {smaller numbers better, indicating that the method 
identifies a true SNP as more significant). The fractional ranks appearing in the gene-dropping column arise from ties: two SNPs being assigned exactly the same 
p-value. Note that it is not completely fair to compare these numbers directly because FBAT and genome-wide association studies (GWAS) produce not available 
(NA) results for a significant portion of the tested SNPs. Relative rank is the normalized ranks of truly influential SNPs: p-value rank divided by the total number 
of non-NA SNPs tested multiplied by 100. SNP position is the base-pair position of the identified truly influential SNP. 



p- Values from linkage and association tests 
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Figure 1 p-Values from linkage and association tests. Here we present the chromosome locations of the 500 most significant single- 
nucleotide polymorphisms (SNPs) on chromosome 3 identified by each method and their corresponding -log p-values. The triangles are the 500 
most significant SNPs identified by the genome-wide association studies (GWAS) score test using unrelated individuals; the crosses are those 
identified by FBAT, and the plusses are those identified by the gene-dropping test. The solid curve shows the -log p-values from the linkage test 
at 449 evenly spaced SNPs (by comparing the linkage component of the gene-dropping test statistic with its Monte Carlo null distribution from 
gene dropping). Solid vertical lines indicate the positions of truly influential SNPs on chromosome 3. 



Table 2 Correlation between log p-values of genome-wide association studies, FBAT, and gene dropping 



GWAS/FBAT 


Gene dropping/FBAT 


Gene dropping/GWAS 


0.011 


0.232 


0.254 



GWAS, genome-wide association studies. 
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individuals: the gene-dropping test extracts information 
from founders by resimulating founder genotypes and 
from offspring by resimulating inheritance vectors. 

It is also possible to derive the analytical mean and 
variance of the test statistic in the gene-dropping test 
where we permute the founder alleles rather than resi- 
mulate the founder alleles. FBAT is more robust to 
population stratification by conditioning on founder 
genotypes. The gene-dropping test can gain similar 
robustness by restricting permutations to founder alleles 
within each family. 

It is somewhat surprising that the gene-dropping test 
did not outperform GWAS given that it uses more indivi- 
duals. One possible interpretation is that the effect of LD 
is stronger when more individuals are used. As we can see 
in Figure 1, the signals detected by the gene-dropping test 
come in bigger clusters. In other words, many SNPs 
ranked high by the gene-dropping test might be in LD 
with one or more of the causal SNPs. 

Separating linkage and association signals 

The gene-dropping test captures both linkage and asso- 
ciation signals. One can decompose the test statistic into 
a linkage component and an association component. 

The association component corresponds to testing asso- 
ciation in the presence of linkage, which requires one to 
condition on the true inheritance vector at the test locus. 
Our results through MCMC approximation show that 
whether or not to condition on the inheritance vector 
actually does not make a big difference for this data set 
because the variance of the test statistic with conditioning 
only differs slightly from the variance of the test statistic 
without conditioning. This conclusion might be dependent 
on the structure of the pedigree. 

The linkage component, however, clearly provides 
valuable information. The linkage signal is stronger in 
most regions containing causal SNPs. It is obvious that 
the linkage curve can help eliminate many of the false 
association signals in this study. It would be interesting 
to investigate how to use the linkage information more 
effectively in the future. 
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